Problem Statement -

Build your own recommendation system for products on an e-commerce website like Amazon.com. Online E-commerce websites like Amazon, Filpkart uses different recommendation models to provide different suggestions to different users.

Amazon currently uses item-to-item collaborative filtering, which scales to massive data sets and produces high-quality recommendations in real time. This type of filtering matches each of the user's purchased and rated items to similar items, then combines those similar items into a recommendation list for the user. In this project we are going to build recommendation model for the electronics products of Amazon. The dataset here is taken from the below website. Source - Amazon Reviews data (http://jmcauley.ucsd.edu/data/amazon/) The repository has several datasets. For this case study, we are using the Electronics dataset.

Dataset columns - first three columns are userId, productId, and ratings and the fourth column is timestamp. You can discard the timestamp column as in this case you may not need to use it.

License

@Misc{Surprise,
author = {Hug, Nicolas},
title = { {S}urprise, a {P}ython library for recommender systems},
howpublished = {\url{http://surpriselib.com}},
year = {2017}
}

1. Read and explore the given dataset. ( Rename column/add headers, plot histograms, find data characteristics)

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from plotly.offline import init_notebook_mode, plot, iplot
import plotly.graph_objs as go
init_notebook_mode(connected=True)

from surprise import SVD
from surprise.model_selection import cross_validate

#import Recommenders as Recommenders
#import Evaluation as Evaluation

%matplotlib inline
In [2]:
df = pd.read_csv('ratings_Electronics.csv',names=['userId','productId','ratings','timestamp'])
In [3]:
df.head()
Out[3]:
userId productId ratings timestamp
0 AKM1MP6P0OYPR 0132793040 5.0 1365811200
1 A2CX7LUOHB2NDG 0321732944 5.0 1341100800
2 A2NWSAGRHCP8N5 0439886341 1.0 1367193600
3 A2WNBOD3WNDNKT 0439886341 3.0 1374451200
4 A1GI0U4ZRJA8WN 0439886341 1.0 1334707200
In [4]:
df = df.drop('timestamp', axis=1)
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7824482 entries, 0 to 7824481
Data columns (total 3 columns):
userId       object
productId    object
ratings      float64
dtypes: float64(1), object(2)
memory usage: 179.1+ MB
In [6]:
df.describe()
Out[6]:
ratings
count 7.824482e+06
mean 4.012337e+00
std 1.380910e+00
min 1.000000e+00
25% 3.000000e+00
50% 5.000000e+00
75% 5.000000e+00
max 5.000000e+00
In [7]:
print("No of Nan values in our dataframe : ", sum(df.isnull().any()))
No of Nan values in our dataframe :  0
In [8]:
dup_bool = df.duplicated(['userId','productId','ratings'])
dups = sum(dup_bool) # by considering all columns..( including timestamp)
print("There are {} duplicate rating entries in the data..".format(dups))
There are 0 duplicate rating entries in the data..
In [9]:
print("Total data ")
print("-"*50)
print("\nTotal no of ratings :",df.shape[0])
print("Total No of Users   :", len(np.unique(df.userId)))
print("Total No of Products  :", len(np.unique(df.productId)))
Total data 
--------------------------------------------------

Total no of ratings : 7824482
Total No of Users   : 4201696
Total No of Products  : 476002
In [10]:
p = df.groupby('ratings')['ratings'].agg(['count'])

# get Products count
prod_count = df.isnull().sum()[1]

# get users count
user_count = df['userId'].nunique() - prod_count

# get rating count
rating_count = df['userId'].count() - prod_count

ax = p.plot(kind = 'barh', legend = False, figsize = (15,10))
plt.title('Total pool: {:,} Products, {:,} users, {:,} ratings given'.format(prod_count, user_count, rating_count), fontsize=20)
plt.axis('off')

for i in range(1,6):
    ax.text(p.iloc[i-1][0]/4, i-1, 'Rating {}: {:.0f}%'.format(i, p.iloc[i-1][0]*100 / p.sum()[0]), color = 'white', weight = 'bold')

from this, We can see that 56% of all ratings in the data are 5, and very few ratings are 2 and 3, low rating products mean they are generally really bad.
Most of the people, rate the products if it is really bad(rating 1: 12%) or if it is very good(rating 4 and 5).
Most people don't normally gives average (2/3) rating

In [11]:
no_of_rated_products_per_user = df.groupby(by='userId')['ratings'].count().sort_values(ascending=False)

no_of_rated_products_per_user.head()
Out[11]:
userId
A5JLAU2ARJ0BO     520
ADLVFFE4VBT8      501
A3OXHLG6DIBRW8    498
A6FIAB28IS79      431
A680RUE1FDO8B     406
Name: ratings, dtype: int64
In [12]:
# Number of ratings per user
data = df.groupby('userId')['ratings'].count().clip(upper=50)

# Create trace
trace = go.Histogram(x = data.values,
                     name = 'ratings',
                     xbins = dict(start = 0,
                                  end = 50,
                                  size = 2))
# Create layout
layout = go.Layout(title = 'Distribution Of Number of Ratings Per User (Clipped at 50)',
                   xaxis = dict(title = 'Ratings Per User'),
                   yaxis = dict(title = 'Count'),
                   bargap = 0.2)

# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)
In [13]:
data = df.groupby('productId')['ratings'].count().clip(upper=50)

# Create trace
trace = go.Histogram(x = data.values,
                     name = 'ratings',
                     xbins = dict(start = 0,
                                  end = 50,
                                  size = 2))
# Create layout
layout = go.Layout(title = 'Distribution Of Number of Ratings Per Product (Clipped at 100)',
                   xaxis = dict(title = 'Number of Ratings Per Product'),
                   yaxis = dict(title = 'Count'),
                   bargap = 0.2)

# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

It is very skewed.. just like number of ratings given per user.
- There are some products (which are very popular) which are rated by huge number of users.
- But most of the movies got some hundreds of ratings.

2. Take a subset of the dataset to make it less sparse/ denser. ( For example, keep the users only who has given 50 or more number of ratings )

In [14]:
# Filter sparse users
min_user_ratings = 50
filter_users = (df['userId'].value_counts() > min_user_ratings)
filter_users = filter_users[filter_users].index.tolist()

# Filter sparse products
min_product_ratings = 50
filter_products = df['productId'].value_counts() > min_product_ratings
filter_products = filter_products[filter_products].index.tolist()

# Actual filtering
df_filterd = df[(df['userId'].isin(filter_users)) & (df['productId'].isin(filter_products))]
del filter_users, min_user_ratings
print('Shape User-Ratings unfiltered:\t{}'.format(df.shape))
print('Shape User-Ratings filtered:\t{}'.format(df_filterd.shape))
Shape User-Ratings unfiltered:	(7824482, 3)
Shape User-Ratings filtered:	(76359, 3)

3. Split the data randomly into train and test dataset. ( For example, split it in 70/30 ratio)

In [15]:
df_filterd.iloc[:int(df_filterd.shape[0]*0.70)].to_csv("train.csv", index=False)
df_filterd.iloc[int(df_filterd.shape[0]*0.70):].to_csv("test.csv", index=False)

train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
In [16]:
print("Training data ")
print("-"*50)
print("\nTotal no of ratings :",train_df.shape[0])
print("Total No of Users   :", len(np.unique(train_df.userId)))
print("Total No of Products  :", len(np.unique(train_df.productId)))
Training data 
--------------------------------------------------

Total no of ratings : 53451
Total No of Users   : 1466
Total No of Products  : 12340
In [17]:
print("Test data ")
print("-"*50)
print("\nTotal no of ratings :",test_df.shape[0])
print("Total No of Users   :", len(np.unique(test_df.userId)))
print("Total No of Products  :", len(np.unique(test_df.productId)))
Test data 
--------------------------------------------------

Total no of ratings : 22908
Total No of Users   : 1424
Total No of Products  : 4215

4. Build Popularity Recommender model.

In [18]:
#Class for Popularity based Recommender System model
class popularity_recommender_py():
    def __init__(self):
        self.train_data = None
        self.user_id = None
        self.item_id = None
        self.popularity_recommendations = None
        
    #Create the popularity based recommender system model
    def create(self, train_data, user_id, item_id):
        self.train_data = train_data
        self.user_id = user_id
        self.item_id = item_id

        #Get a count of user_ids for each unique product as recommendation score
        train_data_grouped = train_data.groupby([self.item_id]).agg({self.user_id: 'count'}).reset_index()
        train_data_grouped.rename(columns = {'userId': 'score'},inplace=True)
    
        #Sort the product based upon recommendation score
        train_data_sort = train_data_grouped.sort_values(['score', self.item_id], ascending = [0,1])
    
        #Generate a recommendation rank based upon score
        train_data_sort['Rank'] = train_data_sort['score'].rank(ascending=0, method='first')
        
        #Get the top 10 recommendations
        self.popularity_recommendations = train_data_sort.head(5)
        
    
    #Use the popularity based recommender system model to
    #make recommendations
    def recommend(self, user_id):    
        user_recommendations = self.popularity_recommendations
        
        #Add user_id column for which the recommendations are being generated
        user_recommendations['userId'] = user_id
    
        #Bring user_id column to the front
        cols = user_recommendations.columns.tolist()
        cols = cols[-1:] + cols[:-1]
        user_recommendations = user_recommendations[cols]
        
        return user_recommendations
    

#Class for Item similarity based Recommender System model
class item_similarity_recommender_py():
    def __init__(self):
        self.train_data = None
        self.user_id = None
        self.item_id = None
        self.cooccurence_matrix = None
        self.products_dict = None
        self.rev_products_dict = None
        self.item_similarity_recommendations = None
        
    #Get unique items (products) corresponding to a given user
    def get_user_items(self, user):
        user_data = self.train_data[self.train_data[self.user_id] == user]
        user_items = list(user_data[self.item_id].unique())
        
        return user_items
        
    #Get unique users for a given item (product)
    def get_item_users(self, item):
        item_data = self.train_data[self.train_data[self.item_id] == item]
        item_users = set(item_data[self.user_id].unique())
            
        return item_users
        
    #Get unique items (products) in the training data
    def get_all_items_train_data(self):
        all_items = list(self.train_data[self.item_id].unique())
            
        return all_items
        
    #Construct cooccurence matrix
    def construct_cooccurence_matrix(self, user_products, all_products):
            
        ####################################
        #Get users for all products in user_products.
        ####################################
        user_products_users = []        
        for i in range(0, len(user_products)):
            user_products_users.append(self.get_item_users(user_products[i]))
            
        ###############################################
        #Initialize the item cooccurence matrix of size 
        #len(user_products) X len(products)
        ###############################################
        cooccurence_matrix = np.matrix(np.zeros(shape=(len(user_products), len(all_products))), float)
           
        #############################################################
        #Calculate similarity between user products and all unique products
        #in the training data
        #############################################################
        for i in range(0,len(all_products)):
            #Calculate unique listeners (users) of product (item) i
            products_i_data = self.train_data[self.train_data[self.item_id] == all_products[i]]
            users_i = set(products_i_data[self.user_id].unique())
            
            for j in range(0,len(user_products)):       
                    
                #Get unique listeners (users) of product (item) j
                users_j = user_products_users[j]
                    
                #Calculate intersection of listeners of products i and j
                users_intersection = users_i.intersection(users_j)
                
                #Calculate cooccurence_matrix[i,j] as Jaccard Index
                if len(users_intersection) != 0:
                    #Calculate union of listeners of products i and j
                    users_union = users_i.union(users_j)
                    
                    cooccurence_matrix[j,i] = float(len(users_intersection))/float(len(users_union))
                else:
                    cooccurence_matrix[j,i] = 0
                    
        
        return cooccurence_matrix

    
    #Use the cooccurence matrix to make top recommendations
    def generate_top_recommendations(self, user, cooccurence_matrix, all_products, user_products):
        print("Non zero values in cooccurence_matrix :%d" % np.count_nonzero(cooccurence_matrix))
        
        #Calculate a weighted average of the scores in cooccurence matrix for all user products.
        user_sim_scores = cooccurence_matrix.sum(axis=0)/float(cooccurence_matrix.shape[0])
        user_sim_scores = np.array(user_sim_scores)[0].tolist()
 
        #Sort the indices of user_sim_scores based upon their value
        #Also maintain the corresponding score
        sort_index = sorted(((e,i) for i,e in enumerate(list(user_sim_scores))), reverse=True)
    
        #Create a dataframe from the following
        columns = ['userId', 'productId', 'score', 'rank']
        #index = np.arange(1) # array of numbers for the number of samples
        df = pd.DataFrame(columns=columns)
         
        #Fill the dataframe with top 10 item based recommendations
        rank = 1 
        for i in range(0,len(sort_index)):
            if ~np.isnan(sort_index[i][0]) and all_products[sort_index[i][1]] not in user_products and rank <= 5:
                df.loc[len(df)]=[user,all_products[sort_index[i][1]],sort_index[i][0],rank]
                rank = rank+1
        
        #Handle the case where there are no recommendations
        if df.shape[0] == 0:
            print("The current user has no products for training the item similarity based recommendation model.")
            return -1
        else:
            return df
 
    #Create the item similarity based recommender system model
    def create(self, train_data, user_id, item_id):
        self.train_data = train_data
        self.user_id = user_id
        self.item_id = item_id

    #Use the item similarity based recommender system model to
    #make recommendations
    def recommend(self, user):
        
        ########################################
        #A. Get all unique products for this user
        ########################################
        user_products = self.get_user_items(user)    
            
        print("No. of unique products for the user: %d" % len(user_products))
        
        ######################################################
        #B. Get all unique items (products) in the training data
        ######################################################
        all_products = self.get_all_items_train_data()
        
        print("no. of unique products in the training set: %d" % len(all_products))
         
        ###############################################
        #C. Construct item cooccurence matrix of size 
        #len(user_products) X len(products)
        ###############################################
        cooccurence_matrix = self.construct_cooccurence_matrix(user_products, all_products)
        
        #######################################################
        #D. Use the cooccurence matrix to make recommendations
        #######################################################
        df_recommendations = self.generate_top_recommendations(user, cooccurence_matrix, all_products, user_products)
                
        return df_recommendations
    
    #Get similar items to given items
    def get_similar_items(self, item_list):
        
        user_products = item_list
        
        ######################################################
        #B. Get all unique items (products) in the training data
        ######################################################
        all_products = self.get_all_items_train_data()
        
        print("no. of unique products in the training set: %d" % len(all_products))
         
        ###############################################
        #C. Construct item cooccurence matrix of size 
        #len(user_products) X len(products)
        ###############################################
        cooccurence_matrix = self.construct_cooccurence_matrix(user_products, all_products)
        
        #######################################################
        #D. Use the cooccurence matrix to make recommendations
        #######################################################
        user = ""
        df_recommendations = self.generate_top_recommendations(user, cooccurence_matrix, all_products, user_products)
         
        return df_recommendations
In [19]:
pm = popularity_recommender_py()
In [20]:
pm.create(train_df, 'userId', 'productId')
In [21]:
users = df_filterd['userId'].unique()
len(users)
Out[21]:
1466
In [22]:
products = df_filterd['productId'].unique()
len(products)
Out[22]:
16555
In [23]:
user_id = users[20]
In [24]:
pm.recommend(user_id)
Out[24]:
userId productId score Rank
8296 A2JOPUWVV0XQJ3 B003ES5ZUU 177 1.0
3203 A2JOPUWVV0XQJ3 B000N99BBC 163 2.0
7192 A2JOPUWVV0XQJ3 B002R5AM7C 127 3.0
9753 A2JOPUWVV0XQJ3 B004CLYEDC 117 4.0
7281 A2JOPUWVV0XQJ3 B002SZEOLG 108 5.0

Item Similarity

In [25]:
is_model = item_similarity_recommender_py()
In [26]:
is_model.create(train_df, 'userId', 'productId')
In [27]:
user_id = users[20]
In [28]:
user_items = is_model.get_user_items(user_id)
In [29]:
#Recommend products for the user using personalized model
is_model.recommend(user_id)
No. of unique products for the user: 33
no. of unique products in the training set: 12340
Non zero values in cooccurence_matrix :24322
Out[29]:
userId productId score rank
0 A2JOPUWVV0XQJ3 1400501776 0.019577 1
1 A2JOPUWVV0XQJ3 B004CLYEDC 0.019445 2
2 A2JOPUWVV0XQJ3 B003ES5ZUU 0.018321 3
3 A2JOPUWVV0XQJ3 B004CLYEE6 0.017083 4
4 A2JOPUWVV0XQJ3 B005CLPP84 0.016861 5
In [30]:
is_model.get_similar_items([20])
no. of unique products in the training set: 12340
Non zero values in cooccurence_matrix :0
Out[30]:
userId productId score rank
0 B006DNXE24 0.0 1
1 B006DKEUWK 0.0 2
2 B006DKEUAM 0.0 3
3 B006DKEQL0 0.0 4
4 B006DEBYWU 0.0 5

5. Build Collaborative Filtering model.

In [31]:
from surprise import Dataset,Reader
reader = Reader(rating_scale=(1, 5))
In [32]:
data = Dataset.load_from_df(df_filterd[['userId', 'productId', 'ratings']], reader)
In [33]:
from surprise import KNNWithMeans,SVD, SVDpp, SlopeOne, NMF, NormalPredictor, KNNBaseline, KNNBasic, KNNWithMeans, KNNWithZScore, BaselineOnly, CoClustering
from surprise import accuracy
from surprise.model_selection import train_test_split
In [34]:
benchmark = []
# Iterate over all algorithms
for algorithm in [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]:
    # Perform cross validation
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=3, verbose=False)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...

6. Evaluate the models. ( Once the model is trained on the training data, it can be used to compute the error (RMSE) on predictions made on the test data.)

In [35]:
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')
Out[35]:
test_rmse fit_time test_time
Algorithm
BaselineOnly 0.954307 0.139307 0.167565
SVD 0.959014 2.711078 0.163220
SVDpp 0.960932 31.248463 1.357708
KNNBaseline 1.039688 0.214750 0.700147
KNNWithMeans 1.045905 0.120012 0.542892
KNNWithZScore 1.050875 0.170540 0.735017
CoClustering 1.061517 2.350042 0.180850
SlopeOne 1.082503 3.942983 0.705801
KNNBasic 1.102813 0.087771 0.525259
NMF 1.135795 4.170515 0.154586
NormalPredictor 1.318443 0.071806 0.127649

BaselineOnly algorithm gave us the best rmse, therefore, we will train and predict with BaselineOnly and use Alternating Least Squares (ALS).

In [36]:
print('Using ALS')
bsl_options = {'method': 'als',
               'n_epochs': 5,
               'reg_u': 12,
               'reg_i': 5
               }
algo = BaselineOnly(bsl_options=bsl_options)
cross_validate(algo, data, measures=['RMSE'], cv=3, verbose=False)
Using ALS
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Out[36]:
{'test_rmse': array([0.95018106, 0.95710772, 0.95583848]),
 'fit_time': (0.07381439208984375, 0.07604384422302246, 0.07683205604553223),
 'test_time': (0.10469555854797363, 0.09773826599121094, 0.17355060577392578)}
In [37]:
trainset, testset = train_test_split(data, test_size=0.30)

predictions = algo.fit(trainset).test(testset)
accuracy.rmse(predictions)
Estimating biases using als...
RMSE: 0.9641
Out[37]:
0.9641416318653331
In [38]:
len(testset)
Out[38]:
22908
In [39]:
testset[0:5]
Out[39]:
[('A3A4ZAIBQWKOZS', 'B00CKK8GEU', 5.0),
 ('AWHL379EE14K7', 'B002UT42UI', 5.0),
 ('A250AXLRBVYKB4', 'B000CR78C4', 4.0),
 ('A3F7F7QKQP2FKT', 'B002HQUIVQ', 5.0),
 ('A3N4I2KRSMACW8', 'B00CBQNB7K', 4.0)]

7. Get top - K ( K = 5) recommendations. Since our goal is to recommend new products to each user based on his/her habits, we will recommend 5 new products.

In [40]:
def get_Iu(uid):
    """ return the number of items rated by given user
    args: 
      uid: the id of the user
    returns: 
      the number of items rated by the user
    """
    try:
        return len(trainset.ur[trainset.to_inner_uid(uid)])
    except ValueError: # user was not part of the trainset
        return 0
    
def get_Ui(iid):
    """ return number of users that have rated given item
    args:
      iid: the raw id of the item
    returns:
      the number of users that have rated the item.
    """
    try: 
        return len(trainset.ir[trainset.to_inner_iid(iid)])
    except ValueError:
        return 0
    
dframe = pd.DataFrame(predictions, columns=['uid', 'iid', 'rui', 'est', 'details'])
dframe['Iu'] = dframe.uid.apply(get_Iu)
dframe['Ui'] = dframe.iid.apply(get_Ui)
dframe['err'] = abs(dframe.est - dframe.rui)
best_predictions = dframe.sort_values(by='err')[:5]
worst_predictions = dframe.sort_values(by='err')[-5:]
In [41]:
dframe[(dframe['uid']=='A2JOPUWVV0XQJ3')].sort_values(by='err')[:5]
Out[41]:
uid iid rui est details Iu Ui err
4574 A2JOPUWVV0XQJ3 B0096T97OG 4.0 3.967371 {'was_impossible': False} 31 7 0.032629
19105 A2JOPUWVV0XQJ3 B005LS2FS6 4.0 4.045982 {'was_impossible': False} 31 2 0.045982
11900 A2JOPUWVV0XQJ3 B00AAKLE00 4.0 3.912575 {'was_impossible': False} 31 2 0.087425
20326 A2JOPUWVV0XQJ3 B0078FBX24 4.0 3.878310 {'was_impossible': False} 31 0 0.121690
11731 A2JOPUWVV0XQJ3 B0045I8E42 4.0 3.878310 {'was_impossible': False} 31 0 0.121690

All Model Recommendations Are shown in below Image

recommendation.png

We are not seeing any kind of commonality in all 3 algorithm, which makes sense. As Popularity, will recommend the popular items, which may or not be equivalent to users behaviour

Some Interesting Insights

In [42]:
best_predictions
Out[42]:
uid iid rui est details Iu Ui err
18420 A1MZL91Z44RN06 B002V8C3W2 5.0 5.0 {'was_impossible': False} 53 27 0.0
20408 A1G650TTTHEAL5 B00A35WYBA 5.0 5.0 {'was_impossible': False} 41 11 0.0
20431 A1C5WS021EL3WO B001TH7T2U 5.0 5.0 {'was_impossible': False} 49 30 0.0
13303 A13WOT3RSXKRD5 B002WE6D44 5.0 5.0 {'was_impossible': False} 31 72 0.0
22216 A5CDMTW6JKV5G B0019EHU8G 5.0 5.0 {'was_impossible': False} 22 56 0.0

The above are the best predictions, meaning that significant number of users have rated the target product.

In [43]:
worst_predictions
Out[43]:
uid iid rui est details Iu Ui err
3645 A16SRDVPBXN69C B000YBH4YU 1.0 4.846124 {'was_impossible': False} 33 6 3.846124
5434 A1X1CEGHTHMBL1 B000U8HBRW 1.0 4.861830 {'was_impossible': False} 68 2 3.861830
8359 A39K52QDP4C3ZS B002HK5AW4 1.0 4.871073 {'was_impossible': False} 20 5 3.871073
620 A1KKE6VX8VPWZK B000U62N1S 1.0 4.923403 {'was_impossible': False} 49 4 3.923403
21746 ACQYIC13JXAOI B00IVPU7DG 1.0 5.000000 {'was_impossible': False} 36 10 4.000000

The worst predictions look pretty surprise. Let’s look in more details of the last one item id 'B001TH7GUU'.
The product was rated by 52 users, user A3QDMDSANPYGUX rated 1, our BaselineOnly algorithm predicts this user would rate 5.

In [44]:
df_filterd.loc[df_filterd['productId'] == 'B001TH7GUU']['ratings'].hist()
plt.xlabel('ratings')
plt.ylabel('Number of ratings')
plt.title('Number of ratings product ID B001TH7GUU has received')
plt.show();

It turns out, most of the ratings this product received was 5, in another word,
most of the users in the data rated this product 5, only very few users rated 1.
Same with the other predictions in “worst predictions” list.
It seems that for each prediction, the users are some kind of outsiders.

8. Summarise your insights.

We went from Popularity model to item similarity model and then we evaluated multiple collaborative filtering methods.
We found that for this dataset, Collaborative Baseline only model was best based on RMSE score.
Of course this is just a baseline, there are more possibilities to improve, for example combining content and collaborative filtering models.
In this way, two or more techniques can be combined to build a hybrid recommendation engine and to improve their overall recommendation accuracy and power.

In [ ]: